Libraries and whatnot

library(tidyverse)
# library(broom)
library(patchwork)
library(scales)
library(dagitty)
library(ggdag)
library(latex2exp)  # Easily convert LaTeX into arcane plotmath expressions
library(ggtext)     # Use markdown in ggplot labels

# Create a cleaner serifed theme to use throughout
theme_do_calc <- function() {
  theme_dag(base_family = "Times New Roman") +
    theme(plot.title = element_text(size = rel(1.5)),
          plot.subtitle = element_markdown())
}

# Make all geom_dag_text() layers use these settings automatically
update_geom_defaults(ggdag:::GeomDagText, list(family = "Times New Roman", 
                                               fontface = "bold",
                                               color = "black"))

Examples

In all the examples that follow, we are trying to estimate a causal relationship between x and y. I think this is \(P(Y| \mbox{do}(X=x))\) NEED TO CHECK THIS

Chains (mediation)

This represents a sequential path from intervention to outcome via a confounder (x → z → y). Can bias causal link between x and y

Forks (common cause)

These occur when a confounding variable impacts both the intervention (x) and the outcome (Y). Can result in a spurious correlations (i.e. x and y will be correlated without existing causality).

Simple fork

This has no causal link between x and y, and a single confounder (z)

fork_dag <- dagify(x ~ z,
                   y ~ z,
                   coords = list(x = c(x = 1, y = 3, z = 2),
                                 y = c(x = 1, y = 1, z = 2)),
                   exposure = "x",
                   outcome = "y")

ggdag(fork_dag) +
  theme_dag()

Independence: x and y are independent (conditioned on z)

impliedConditionalIndependencies(fork_dag)
## x _||_ y | z

Adjustment set: shows what adjustment is necessary to obtain the causal link between input and outcome

ggdag_adjustment_set(fork_dag) +
  theme_dag()

Simulated data - Equations

  • z: \(N(10, 5)\)
  • x ~ z: \(\beta_1 = 5\)
  • y ~ z: \(\beta_2 = 2\)
n <- 1000
z <- rnorm(n, 10, 5)
x <- z * 5 + rnorm(n, 0, 2)
y <- z * 2 + rnorm(n, 0, 2)
df <- data.frame(x, y, z)
pairs(df)

Model of y ~ x. Ignoring the confounder results in spurious relationship between intervention and outcome

fit <- lm(y ~ x, df)
summary(fit)
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1274 -1.4533  0.0301  1.5496  6.3088 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.067120   0.153951   0.436    0.663    
## x           0.398542   0.002802 142.238   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.192 on 998 degrees of freedom
## Multiple R-squared:  0.953,  Adjusted R-squared:  0.9529 
## F-statistic: 2.023e+04 on 1 and 998 DF,  p-value: < 2.2e-16

Model controlling for z. This removes the backdoor path between x and y and removes the apparent relationship

fit <- lm(y ~ x + z, df)
summary(fit)
## 
## Call:
## lm(formula = y ~ x + z, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9830 -1.3471 -0.0494  1.4390  5.8065 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02228    0.14122   0.158    0.875    
## x           -0.04192    0.03208  -1.307    0.192    
## z            2.20828    0.16032  13.774   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.01 on 997 degrees of freedom
## Multiple R-squared:  0.9605, Adjusted R-squared:  0.9604 
## F-statistic: 1.212e+04 on 2 and 997 DF,  p-value: < 2.2e-16

Simple fork 2

This is the same graph as before, but now we include a real relationship between x and y

fork_dag <- dagify(x ~ z,
                   y ~ x + z,
                   coords = list(x = c(x = 1, y = 3, z = 2),
                                 y = c(x = 1, y = 1, z = 2)),
                   exposure = "x",
                   outcome = "y")

ggdag(fork_dag) +
  theme_dag()

Independence: Now x and y have no conditional independence

impliedConditionalIndependencies(fork_dag)

Adjustment set:

ggdag_adjustment_set(fork_dag) +
  theme_dag()

Equations

  • z: \(N(10, 5)\)
  • x ~ z: \(\beta_1 = 5\)
  • y ~ x + z: \(\beta_2 = -4\), \(\beta_3 = 2\)
n <- 1000
z <- rnorm(n, 10, 5)
x <- z * 5 + rnorm(n, 0, 5)
y <- x * -4 + z * 2 + rnorm(n, 0, 5)
df <- data.frame(x, y, z)
pairs(df)

Model of y ~ x. Without controlling for the confounder, the coefficient relating x and y is biased (the relationship exists, but it’s biased)

fit <- lm(y ~ x, df)
summary(fit)
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6007  -3.8928  -0.1075   3.6774  16.6773 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.248996   0.373976    3.34  0.00087 ***
## x           -3.624424   0.006655 -544.61  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.313 on 998 degrees of freedom
## Multiple R-squared:  0.9966, Adjusted R-squared:  0.9966 
## F-statistic: 2.966e+05 on 1 and 998 DF,  p-value: < 2.2e-16

Model controlling for z. Now we get the right value for the coefficient (\(\sim4\))

fit <- lm(y ~ x + z, df)
summary(fit)
## 
## Call:
## lm(formula = y ~ x + z, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2661  -3.6346   0.0608   3.5008  15.0201 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  0.61806    0.36302    1.703    0.089 .  
## x           -3.93228    0.03199 -122.903   <2e-16 ***
## z            1.60482    0.16346    9.818   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.076 on 997 degrees of freedom
## Multiple R-squared:  0.9969, Adjusted R-squared:  0.9969 
## F-statistic: 1.625e+05 on 2 and 997 DF,  p-value: < 2.2e-16

Two variable fork

This is to illustrate Pearl’s point about uncessary confounders. This graph has a backdoor path with two confounding variables (z1 and z2). In order to test a causal link between x and y, we only need to control for one, as either will block the back door path

fork_dag <- dagify(x ~ z1,
                   z2 ~ z1,
                   y ~ z2,
                   coords = list(x = c(x = 1, y = 3, z1 = 1.5, z2 = 2.5),
                                 y = c(x = 1, y = 1, z1 = 2, z2 = 1.5)),
                   exposure = "x",
                   outcome = "y")

ggdag(fork_dag) +
  theme_dag()

Independence:

impliedConditionalIndependencies(fork_dag)
## x _||_ y | z2
## x _||_ y | z1
## x _||_ z2 | z1
## y _||_ z1 | z2

Adjustment set: Now we get two adjustment sets, as controlling for either z1 or z2 removes the path

ggdag_adjustment_set(fork_dag) +
  theme_dag()

Equations

  • z1: \(N(10, 5)\)
  • x ~ z1: \(\beta_1 = 5\)
  • y ~ z2: \(\beta_2 = 2\)
  • z2 ~ z1: \(beta_3 = -4\)
n <- 1000
z1 <- rnorm(n, 10, 5)
x <- z1 * 5 + rnorm(n, 0, 5)
z2 <- z1 * -4 + rnorm(n, 0, 5)
y <- z2 * 2 + rnorm(n, 0, 5)
df <- data.frame(x, y, z1, z2)
pairs(df)

Model of y ~ x (shows spurious relationship)

fit <- lm(y ~ x, df)
summary(fit)
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.905  -8.940  -0.692   9.559  47.136 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.52003    0.94845  -2.657  0.00801 ** 
## x           -1.55071    0.01683 -92.140  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.77 on 998 degrees of freedom
## Multiple R-squared:  0.8948, Adjusted R-squared:  0.8947 
## F-statistic:  8490 on 1 and 998 DF,  p-value: < 2.2e-16

Control for z1

fit1 <- lm(y ~ x + z1, df)
summary(fit1)
## 
## Call:
## lm(formula = y ~ x + z1, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.411  -7.670   0.018   7.818  36.569 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.42923    0.79373   0.541    0.589    
## x           -0.01813    0.07197  -0.252    0.801    
## z1          -7.95320    0.36645 -21.703   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.36 on 997 degrees of freedom
## Multiple R-squared:  0.9286, Adjusted R-squared:  0.9284 
## F-statistic:  6480 on 2 and 997 DF,  p-value: < 2.2e-16

Control for z2

fit2 <- lm(y ~ x + z2, df)
summary(fit2)
## 
## Call:
## lm(formula = y ~ x + z2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8806  -3.5402   0.1354   3.3948  13.7955 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.179837   0.345638    0.52    0.603    
## x           -0.005798   0.019989   -0.29    0.772    
## z2           1.998835   0.024626   81.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.996 on 997 degrees of freedom
## Multiple R-squared:  0.9862, Adjusted R-squared:  0.9861 
## F-statistic: 3.556e+04 on 2 and 997 DF,  p-value: < 2.2e-16

Control for both (this works, but is more costly in regression terms)

fit3 <- lm(y ~ x + z1 + z2, df)
summary(fit3)
## 
## Call:
## lm(formula = y ~ x + z1 + z2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.0223  -3.4918   0.1094   3.3704  13.7611 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.13739    0.34927   0.393    0.694    
## x           -0.02669    0.03166  -0.843    0.399    
## z1           0.17418    0.20469   0.851    0.395    
## z2           2.01523    0.03127  64.451   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.997 on 996 degrees of freedom
## Multiple R-squared:  0.9862, Adjusted R-squared:  0.9861 
## F-statistic: 2.37e+04 on 3 and 996 DF,  p-value: < 2.2e-16

Colliders

This is the second type of path junction that can cause problems. Here, our two variables (x and y) are independent, but are both linked to z, the collider. Ths issue with this relationship is the bias that results when we control for the collider